© Springer Nature Switzerland AG 2023
Y. Xia, J. Sun Bioinformatic and Statistical Analysis of Microbiome Data https://doi.org/10.1007/978-3-031-21391-5_2

2. Introduction to R for Microbiome Data

Yinglin Xia1   and Jun Sun 1
(1)
Department of Medicine, University of Illinois Chicago, Chicago, IL, USA
 

Abstract

This chapter first introduces some useful R functions and R packages for microbiome data. Then it illustrates some specifically designed R packages for microbiome data analysis (e.g., phyloseq and microbiome). Next it briefly describes three R packages for analysis of phylogenetics (ape, phytools, and castor). Following that it introduces the BIOM format and the biomformat package and illustrates creating a microbiome dataset for longitudinal data analysis.

Keywords
saveRDS() save() save.image() readRDS() load() download.file() readr ggpubr Box plots Violin plots Density plots Histogram plots tidyverse phyloseq package phyloseq object Microbiome package CSV Mothur BIOM format ampvis2 package curatedMetagenomicData Phylogenetics Ape Phytools Castor read.table() read.delim() read.csv() read.csv2() write.table() read.csv()

In Chap. 1, we discussed the general features of QIIME 2. In Chap. 2, we first introduce some useful R functions (Sect. 2.1). Then, we introduce some useful R packages for microbiome data (Sect. 2.2). In Sect. 2.3, we illustrate some specifically designed R packages for microbiome data including the phyloseq and microbiome packages. In Sect. 2.4, we briefly describe three R packages for analysis of phylogenetics including ape, phytools, and castor. In Sects. 2.5 and 2.6, we introduce the BIOM format and the biomformat package and how to create a microbiome dataset for longitudinal data analysis. Finally, we summarize the main contents of this chapter in Sect. 2.7.

2.1 Some Useful R Functions

This section illustrates some useful R functions. Most microbiome data analyses in this book will use R software (R version 4.1.2; 2021-11-01). Some general R functions, R packages, and specifically designed R packages for microbiome data analysis are very useful. Let’s first briefly introduce these functions and packages before we move on to other chapters.

Example 2.1: Iris Data

The famous Iris data were collected by Edgar Anderson in 1935 (Anderson 1935) and first used by Fisher in 1936 (Fisher 1936) to illustrate a taxonomic problem. The dataset measures the variables in centimeters of sepal length and width, and petal length and width, respectively, for 50 flowers from each of 3 Iris species (setosa, versicolor, and virginica). The dataframe iris has 150 cases (rows) and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species. The Iris dataset is wrapped in R package.

In this section, we use the Iris dataset to illustrate some useful R functions. In Chap. 4 (Introduction to R, RStudio, and ggplot2) of our previous book on microbiome data analysis (Xia et al. 2018), we introduced some basic R functions for data import and export, including read.table(), read.delim(), read.csv(), read.csv2(), and write.table(). Some of them such as read.csv() are used very frequently across most chapters in this book. Here, we introduce additional basic R functions that could be useful in your data analysis.

2.1.1 Save Data into R Data Format

Writing R data into txt, csv, or Excel file formats facilitates data analysis using other software, such as Excel. However, all these file formats have their own data structures rather than reserve the original R data structures, such as vectors, matrices, and data frames as well as column data types (numeric, character, or factor). The data files will be automatically compressed after the data are saved into R data formats (e.g., .rda files); thus, one benefit of saving data into R data formats can substantially reduce the size of converted files.

2.1.1.1 Use SaveRDS() to Save Single R Object

We can use saveRDS() to save single R object to the file with .rds format. The general syntax of this function is given:
saveRDS(obj, file = "name_file.rds")
where obj is the R object to save and file is used to specify the name of the file where the R object is saved to (the > below are prompts).
> setwd("~/Documents/QIIME2R/Ch2_IntroductionToR")
> data(iris)
> # Save a singel object to a file
> saveRDS(iris, file = "iris_saved.rds")

2.1.1.2 Use Save() to Save Multiple R Objects

We can use save() to save multiple R objects to the file with .RData format. The general syntax of this function is given:
save(obj, file = "name_file.RData")
> head(iris,3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1   5.1   3.5   1.4   0.2 setosa
2   4.9   3.0   1.4   0.2 setosa
3   4.7   3.2   1.3   0.2 setosa
> # Save on object with RData format and factor
> save(iris, file = "iris.RData")
Next, let’s save each column individually as vectors.
> # Save the iris data as vectors
> Sepal.Length<-iris$Sepal.Length
> Sepal.Width<-iris$Sepal.Width
> Petal.Length<-iris$Petal.Length
> Petal.Width<-iris$Petal.Width
> Species<-iris$Species
> class(Sepal.Length)
[1] "numeric"
> class(Sepal.Width)
[1] "numeric"
> class(Petal.Length)
[1] "numeric"
> class(Petal.Width)
[1] "numeric"
> class(Species)
[1] "factor"
We can use the save() function to save all these 4 vectors and 1 factor to an .rda file. The “file” is renamed as “iris_saved.rda.”
> # Save to rda file
> save(Sepal.Length, Sepal.Width, Petal.Length,Petal.Width, Species, file = "iris_saved.rda")

2.1.1.3 Use Save.Image() to Save Workspace Image

By default, save.image() stores the workspace to a file named .RData. We can also specify the file name for saving the work space, such as to save the work space as “myworkspace” using save.image(file = “myworkspace.RData”). Actually, you may notice that each time when you close R/RStudio, you are asked whether or not to save your workspace. If you click yes, then the next time when you start R, the saved workspace (with .RData) will be loaded.

2.1.2 Read Back Data from R Data Format Files

The saved R files can be read back to R workspace using either readRDS() or load() functions.

2.1.2.1 Use ReadRDS() to Read Back Single R Object

The general syntax of this function is given:
readRDS (file = "name_file.rds")
We can rename the restored object. Where the file is used to name the file where the R object is read from.
> # Restore a saved R object using a different name
> iris_retored <- readRDS("iris_saved.rds")
> head(iris_retored,3)
 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1   5.1   3.5   1.4   0.2 setosa
2   4.9   3.0   1.4   0.2 setosa
3   4.7   3.2   1.3   0.2 setosa

2.1.2.2 Use Load() to Read Back Multiple Objects from R File

The syntax of this function is given:
load("name_file.RData")
The load() function will automatically load the original object if the data was saved with save(). In other words, the original object cannot be renamed when it is read back.
> # Load the RData data
> load("iris.RData")
> head(iris,3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1   5.1   3.5   1.4   0.2 setosa
2   4.9   3.0   1.4   0.2 setosa
3   4.7   3.2   1.3   0.2 setosa
> # Load the rda data
> load(file = "iris_saved.rda")
> head(iris,3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1   5.1   3.5   1.4   0.2 setosa
2   4.9   3.0   1.4   0.2 setosa
3   4.7   3.2   1.3   0.2 setosa

2.1.3 Use Download.File() to Download File from Website

The R function download.file() can be used to download file from a website, such as a webpage, an R file, a tar.gz file, etc. The simplest syntax of this function is with two arguments: download.file(url, destfile). url is a character string (or longer vector) naming the URL of the file to be downloaded; destfile is a character string (or vector) with the name where the downloaded file is saved (path with a file name). For example, to directly download the QIIME 2 files: “feature-table.qza” from the following QIIME 2 website to R or RStudio, we call the download.file().

> download.file("https://data.qiime2.org/2022.2/tutorials/exporting/feature-table.qza", "feature-table.qza")
The downloaded feature-table files can be read back to R using the qiime2R package (see Chap. 3).
> library(qiime2R)
> feature_tab<-read_qza("feature-table.qza")
> head(feature_tab)

2.2 Some Useful R Packages for Microbiome Data

There are very useful and often used R packages. In this section, we introduce and illustrate some of them.

2.2.1 Readr

Example 2.2: Iris Data, Example 2.1, Cont.

In this section, we use the famous Iris data to illustrate some useful R package. The package readr was developed by Hadley Wickham et al. (Wickham, Hester, and Francois 2018) for fast reading and writing delimited files or csv files (current version 1.4.0, March 2022). It contains the function write_delim(), write_csv(), and write_tsv() to easily export a data from R. We continue to use the Iris dataset to illustrate the readr package.

Installing and Loading readr and Data
> setwd("~/Documents/QIIME2R/Ch2_IntroductionToR")
# Installing
install.packages("readr")
# Loading
library("readr")
> # Load data
> data(iris)
> df <- iris
> head(df, 3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1   5.1   3.5   1.4   0.2 setosa
2   4.9   3.0   1.4   0.2 setosa
3   4.7   3.2   1.3   0.2 setosa
Writing Data Using the Readr Functions

Depending on data format, we can choose the write_delim(), write_csv(), or write_tsv() to export a data table from R. Of which, the write_delim() function is a general function of writing data from R; write_csv() is used to write a comma (“,”) separated values; while write_tsv() is used to write a tab separated (“\t”) values. The simplified general syntax of these functions is given as follows:

write_delim(df, file, delim = " ") # General function of writing data from R
write_csv(df, file) # Write comma (",") separated value files
write_tsv(df, file) # Write tab ("\t") separated value files
where df is a data frame to be written; file is used to specify the file name to be saved; while delim is the delimiter that is used to separate values. It must be single character.
The following commands write the Iris dataset to a txt (i.e., tsv) file.
> # Write data from R to a txt (i.e., tsv) file
> write_tsv(iris, file = "iris.txt")

The built-in R iris dataset has been exported to a tab-separated (sep = “\t”) file called iris.txt in the current working directory: setwd(“~/Documents/QIIME2R/Ch2_IntroductionToR”).

The following commands write the Iris dataset to a csv file.
> # Write data from R to a csv file
> write_csv(iris, file = "iris.csv")

2.2.2 Ggpubr

Example 2.3 Iris Data, Example 2.2, Cont.

The R package ggpubr written by Alboukadel Kassambara (alboukadel.​kassambara@gmail.​com) (Kassambara 2020-06-27), is a ggplot2-based package that provides some easy-to-use functions for creating and customizing publication ready plots (current version 0.4.0, March 2022). The ggplots package written by Hadley Wickham (2016) is an excellent and flexible visualization tool in R. However the default generated plots are not ready for publication. Furthermore, the customized ggplot syntax is opaque, which challenges the R beginners and researchers with no advanced R programming skills. The ggpubr package facilitates the creation of beautiful ggplot2-based graphs with some key features, including less opaque syntax than the wrapped ggplot2 package, providing publication-ready plots, is able to automatically add p-values and significance levels to box, bar, and line plots, facilitating arranging and annotating multiple plots on the same page, and changing graphical parameters (e.g., colors and labels). Here, we illustrate some plots that are useful for exploration of microbiome data.

Installation and Loading
There are two ways to install the ggpubr package: One is from CRAN as follows:
install.packages("ggpubr")
Or we can type the codes to install the latest version from GitHub as follows:
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/ggpubr")
After installation, we can set working directory, call the library, and load the dataset as below:
> setwd("~/Documents/QIIME2R/Ch2_IntroductionToR")
> library(ggpubr)
> data(iris)
> df <- iris
> head(df, 3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1   5.1   3.5   1.4   0.2 setosa
2   4.9   3.0   1.4   0.2 setosa
3   4.7   3.2   1.3   0.2 setosa

2.2.2.1 Box Plots

The following commands create the box plots with jittered points (Fig. 2.1):
> # Figure 2.1
> # Use outline colors for groups: Species
> # Use custom color palette
> # Add jitter points and use different shapes for groups
> # The plus sign ("+") below appears in the console, meaning that
> # R is wanting to enter some additional information
> p <- ggboxplot(df, x = "Species", y = "Sepal.Length",
+   color = "Species", palette =c("#00AFBB", "#E7B800", "#FC4E07"),
+   add = "jitter", shape = "Species")
> p

A box plot with jittered points has Sapel.Length versus species of setosa, Versicolor, and verginica. An Iris dataset of three species has different Sapel.Length, verginica has the highest and setosa has the lowest.

Fig. 2.1

Box plots with jittered points using Iris dataset

The following commands add a global p-value and p-values for pairwise comparisons to the box plots with jittered points (Fig. 2.2):
> # Figure 2.2
> # Add p-values for comparing groups
> # Specify the pairwise group comparisons
> comps <- list( c("setosa", "versicolor"), c("setosa", "virginica"), c("versicolor", "virginica") )
> p + stat_compare_means(comparisons = comps)+ # Add global p-value and p-values for pairwise comparisons
+ stat_compare_means(label.y = 12)

A box plot with jittered points has Sapel. Length versus species of Setosa, Versicolor, and Verginica. It depicts the p-value and p-values for pairwise comparisons of setosa versus versicolor, setosa versus virginica, and versicolor and virginica.

Fig. 2.2

Box plots with a global p-value and p-values for pairwise comparisons

The following plots show the significance levels for pairwise comparisons (Fig. 2.3):
> # Figure 2.3
> # Specify the comparisons of interest
> comps <- list( c("setosa", "versicolor"), c("setosa", "virginica"), c("versicolor", "virginica") )
> p + stat_compare_means(comparisons = comps,label = "p.signif")+ # Show the significance levels
+ stat_compare_means(label.y = 12)         # Add global p-value

A box plot with jittered points has Sapel.Length versus species of Setosa, Versicolor, and Verginica. It depicts the p-value and the significance levels for pairwise comparisons of setosa versus versicolor, setosa versus virginica, and versicolor and virginica.

Fig. 2.3

Box plots with a global p-value and the significance levels for pairwise comparisons

2.2.2.2 Violin Plots

The following commands create the violin plots with box plots inside and p-values (Fig. 2.4):
> # Figure 2.4
> # Use outline colors for groups: Species
> # Use custom color palette
> # Add jitter points and use different shapes for groups
> # Add boxplot with white fill color
> # Specify the comparisons you want
> # Add p-values for comparing groups
> ggviolin(df, x = "Species", y = "Sepal.Length", fill = "Species",
+   palette = c("#00AFBB", "#E7B800", "#FC4E07"),
+   add = "boxplot", add.params = list(fill = "white"))+
+ stat_compare_means(comparisons = comps)+ # Add pairwise comparisons p-value
+ stat_compare_means(label.y = 12)

A violin plot with box plots has Sapel.Length versus species of setosa, versicolor, and verginica. It depicts that inside, adding the p-value and the p-values for pairwise comparisons of setosa versus versicolor, setosa versus virginica, and versicolor and virginica.

Fig. 2.4

Violin plots with box plots inside adding a global p-value and p-values for pairwise comparisons

The following commands create the violin plots with box plots inside and label significance levels (Fig. 2.5):
> # Figure 2.5
> ggviolin(df, x = "Species", y = "Sepal.Length", fill = "Species",
+   palette = c("#00AFBB", "#E7B800", "#FC4E07"),
+   add = "boxplot", add.params = list(fill = "white"))+
+ stat_compare_means(comparisons = comps, label = "p.signif")+ # Add significance levels
+ stat_compare_means(label.y = 12) # Add global the p-value

A violin plot with box plots has Sapel.Length versus species of setosa, versicolor, and verginica. It depicts that by adding the p-value and the significance levels for pairwise comparisons of setosa versus versicolor, setosa versus virginica, and versicolor and virginica.

Fig. 2.5

Violin plots with box plots inside adding a global p-value and the significance levels for pairwise comparisons

2.2.2.3 Density Plots

The following commands plot the density of Sepal.Width with mean lines and marginal rug and use different colors to label the three species (Fig. 2.6):
> # Figure 2.6
> # Density plot with mean lines and marginal rug
> # Change outline and fill different colors for groups ("Species")
> # Use custom palette
> ggdensity(df, x = "Sepal.Width",
+   add = "mean", rug = TRUE,
+   color = "Species", fill = "Species",
+   palette = c("#00AFBB", "#E7B800", "#FC4E07"))

A plot depicts the density versus Sepal. Width with mean lines. It has a marginal rug that differentiates the colors of species.

Fig. 2.6

Density plot of Sepal.Width with mean lines and marginal rug differentiating colors of species

2.2.2.4 Histogram Plots

The following codes plot the histogram of Sepal.Width with mean lines and marginal rug and use different colors to label the three species (Fig. 2.7):
> # Figure 2.7
> # Change outline and fill different colors for groups ("Species")
> # Use custom color palette
> gghistogram(df, x = "Sepal.Width",
+   add = "mean", rug = TRUE,
+   color = "Species", fill = "Species",
+   bins = 20,
+   palette = c("#00AFBB", "#E7B800","#FC4E07"))

A histogram plot depicts the count versus Sepal. Width with mean lines. It has a marginal rug that differentiates the colors of species.

Fig. 2.7

Histogram plot of Sepal.Width with mean lines and marginal rug differentiating colors of species

2.2.3 Tidyverse

The tidyverse package (a single “meta” package) was developed by Wickham et al. (Wickham et al. 2019) to collect R packages that share a high-level design philosophy and low-level grammar and data structures, enabling the R packages in the set work in harmony (current version 1.3.1, March 2022). The tidyverse is not the tools for statistical modelling or communication, instead its goal is to facilitate data import, tidying, manipulation, visualization, and programming. For example, it allows users to install all tidyverse packages with a single command. We can either install it from CRAN or via GitHub.

# Install from CRAN
> install.packages("tidyverse")
> # Or the development version from GitHub
> # install.packages("devtools")
> devtools::install_github("tidyverse/tidyverse")
> library(tidyverse)
When library(tidyverse) is called, the core tidyverse packages will be loaded:
── tidyverse 1.3.0 ──
✓ ggplot2 3.3.2 ✓ purrr 0.3.4
✓ tibble 3.0.3 ✓ dplyr 1.0.2
✓ tidyr 1.1.0 ✓ stringr 1.4.0
✓ readr 1.3.1 ✓ forcats 0.5.0
The conflicts with other previously loaded packages are also summarized:
── tidyverse_conflicts() ──
x dplyr::filter () masks stats::filter ()
x dplyr::lag() masks stats::lag()

The tidyverse consists of four programs: (1) import, (2) tidy, (3) understand (transformation, visualize, and model), and (4) communicate. The programming works out this way (Wickham et al. 2019): starts with data import➔ tidy data➔ data transformation➔ understanding data: visualization and modellingcommunication.

2.3 Specifically Designed R Packages for Microbiome Data

In this section, we introduce some specifically designed R packages for microbiome data analysis.

2.3.1 Phyloseq

The phyloseq package was developed to handle and analyze the high-throughput microbiome census data. The package was written by McMurdie and Holmes (McMurdie and Holmes 2013) with contributions from Gregory Jordan and Scott Chamberlain (McMurdie and Holmes 2022) (current version 1.36.0, March 2022). The phyloseq classes and tools facilitate the import, storage, analysis, and graphical display of microbiome census data. The phyloseq package mainly wraps statistical functions from vegan package (F. Jari Oksanen et al. 2018) and ggplot2 graphical package (Wickham 2016). Among the classes and tools, the phyloseq object and the operations on this object are very useful. The object and some functions of data operations have been adopted in other R microbiome packages for data processing and analysis such as in packages curatedMetagenomicData (Pasolli et al. 2017) and microbiome (Lahti 2017).

Example 2.4: Diet Swap Between Rural and Western Populations

O’Keefe et al. in 2015 (S.J.D. O’Keefe et al. 2015) reported a 2-week diet swap study between western (USA) and traditional (rural Africa) diets to investigate whether the higher rates of colon cancer are associated with consumption of higher animal protein and fat, and lower fiber. The data is available for download from Data Dryad (S.J.D. O’Keefe 2016).

The dataset contains a HITChip data matrix and a sample metadata. The HITChip data matrix is a csv file which contains 222 samples from this study, providing the absolute HITChip phylogenetic microarray signal estimate for 130 taxa (genus-like groups). The sample metadata is also a CSV file, containing various measurement variables, in which SampleID is the unique sample identified corresponding to samples in the HITChip data matrix; subject is the subject identifier (some subjects have multiple time points); bmi is the standard body-mass classification (defining underweight, bmi <18.5; lean, 18.5≤bmi <25; overweight, 25≤bmi <30; obese, 30≤bmi <35; severe obese, 35≤bmi <40; morbid obese, 40≤bmi <45; superobese, bmi >45); sex (male/female); nationality is African American (AAM) and Native African (AFR); timepoint.total refers to the time point in the overall dataset ranging from 1 to 6; timepoint.group is the time point (1/2) within the group and group the sample treatment group representing Dietary intervention (DI)/Home environment (HE)/Solid stool pre-colonoscopy (ED).

2.3.1.1 Create a phyloseq Object

Microbiome abundance and taxonomic datasets are generated by next-generation sequencing (NGS) techniques. Currently 16S rRNA gene sequencing and whole-metagenome shotgun sequencing are the two most commonly used sequencing approaches (Xia and Sun 2022).

After taking initial quality control steps to account for errors in the sequencing process, the data generated by microbiome community sequencing typically are organized into large matrices where usually the columns represent samples, and the rows contain observed counts of clustered sequences commonly known as operational taxonomic units (OTUs/ASVs) that represent types of bacteria or other features. We often call these tables as OTU/ASVs tables or feature tables. We will introduce how to build feature table from raw sequencing reads in Chap. 4, assign taxonomy, building phylogenetic tree in Chap. 5, and cluster OTUs in Chap. 6. For now, we focus on how to create a phyloseq object in the phyloseq package.

One important concept and procedure in phyloseq package is the phyloseq object. A full phyloseq object has four data components: otu_table, sample_data, tax_table, and phylo. At least two components are needed to analyze the data of phyloseq object. The phyloseq object is created via the phyloseq() function. When the phyloseq object is created, the validity and coherency between data components is checked by the phyloseq-class constructor, which is invoked internally by the importers.

Here, we use diet swap dataset to illustrate how to create a phyloseq object and some operations on phyloseq object. First, we need to install this package. Type the following commands in R (version “3.6” or later) or RStudio (the > below are prompts):
>if (!requireNamespace("BiocManager", quietly = TRUE))
> install.packages("BiocManager")
>BiocManager::install("phyloseq")
> packageVersion("phyloseq")
[1] ‘1.36.0’
In the directory “~/Documents/QIIME2R/Ch2_IntroductionToR”, we store three datasets: (1) otu-table with Samples (rows) by OTUs (columns) format called “otu_table_dietswap.csv”; (2) taxa-table with OTUs(rows) by Taxa (columns) format called “tax_table_dietswap.csv”; and (3) sample metadata called “metadata_dietswap.csv”, which is a data frame.
  • Step 1: Set the R working directory to the folder that these datasets are stored and load these datasets.

> setwd("~/Documents/QIIME2R/Ch2_IntroductionToR")
> otu_tab <- read.csv("otu_table_dietswap.csv", row.names = 1)
> tax_tab <- read.csv("tax_table_dietswap.csv",row.names = 1)
> meta_tab <- read.csv("metadata_dietswap.csv",row.names = 1)
There are two ways to create a phyloseq object: One is to use the phyloseq() function and another is to use the merge_phyloseq () function. The arguments of the phyloseq () are an otu_table and any unordered list of valid phyloseq components: sample_data, tax_table, phylo, or XStringSet. The tip labels of a phylo-object (tree) and the sequence names of an XStringSet object must match the OTU names of the otu_table. The merge_phyloseq() is most useful when we already created a phyloseq object and want to add a separately imported components to this already-created phyloseq object. The merge_phyloseq() can take any number of phyloseq objects and/or phyloseq components to combine them into one larger phyloseq object.
  • Step 2: Check and make sure that the datasets have been correctly loaded.

We can use the head() function to check whether we import the correct datasets including the sample IDs, data format, and taxonomy ranks.
> head(otu_tab,3)
> head(tax_tab,3)
> head(meta_tab,3)
  • Step 3: Check and make sure that the classes of datasets are correct.

One important thing when creating the phyloseq object is to make sure that both otu_table and tax_table components are matrices and sample_data is a data frame. The component of otu_table works on any numeric matrix. When we use the phyloseq () to create a phyloseq object, we must also specify if the taxa are rows or columns based on the loaded otu dataset. The component of tax_table works on any character matrix. The row names of tax_table must match the OTU names (taxa_names) of the otu_table if we combine these two tables with a phyloseq object. The component of sample_data works on any dataframe. The row names of sample_data must match the sample names in the otu_table if we combine them as a phyloseq object.
> class(otu_tab)
[1] "data.frame"
> class(tax_tab)
[1] "data.frame"
> class(meta_tab)
[1] "data.frame"

Since the above codes show that both otu_table and tax_table components are data frames, we need to convert them into matrices.

> otumat<-as.matrix(otu_tab)
> taxmat<-as.matrix(tax_tab)
> class(otumat)
[1] "matrix"
> class(taxmat)
[1] "matrix"
  • Step 4: Create phyloseq object using either phyloseq () or merge_phyloseq().

Now we can create a phyloseq object. We first use the phyloseq() function as below. It is important to make sure to specify whether taxa_are_rows = TRUE or FALSE based on the loaded otu matrix. Here, taxa_are_rows = FALSE.

> library(phyloseq)
> otu<-otu_table(otumat,taxa_are_rows = FALSE)
> tax<-tax_table(taxmat)
> sam<-sample_data(meta_tab)
> physeq = phyloseq(otu, tax, sam)
> physeq
phyloseq-class experiment-level object
otu_table() OTU Table: [ 37 taxa and 222 samples ]
sample_data() Sample Data: [ 222 samples by 7 sample variables ]
tax_table() Taxonomy Table: [ 37 taxa by 3 taxonomic ranks ]

The above created phyloseq object shows that OTU Table has 37 taxa and 222 samples, Sample Data has 222 samples by 7 sample variables, and Taxonomy Table has 37 taxa by 3 taxonomic ranks. One mistake that sometime happens is that the order or names of sample IDs in OTU Table and Sample Data are not exactly matched. We can use the sample_names() function to check.

> sample_names(otu)
[1] "Sample-1" "Sample-2" "Sample-3" "Sample-4" "Sample-5"
[6] "Sample-6" "Sample-7" "Sample-8" "Sample-9" "Sample-10"
[11] "Sample-11" "Sample-12" "Sample-13" "Sample-14" "Sample-15"
------
> sample_names(sam)
[1] "Sample-1" "Sample-2" "Sample-3" "Sample-4" "Sample-5"
[6] "Sample-6" "Sample-7" "Sample-8" "Sample-9" "Sample-10"
[11] "Sample-11" "Sample-12" "Sample-13" "Sample-14" "Sample-15"
------
Now let’s illustrate how to use the merge_phyloseq() function to add the new data components to the phyloseq object we already have. Assume that we first use the phyloseq() to create a phyloseq object physeq1 by otu table and sample data.
> otu<-otu_table(otumat,taxa_are_rows = FALSE)
> tax<-tax_table(taxmat)
> sam<-sample_data(meta_tab)
> physeq1 = phyloseq(otu,sam)
> physeq1
phyloseq-class experiment-level object
otu_table() OTU Table:   [ 130 taxa and 222 samples ]
sample_data() Sample Data:   [ 222 samples by 7 sample variables ]
Now we can use the merge_phyloseq() function to merge the taxonomy data into the current phyloseq object physeq1:
> physeq2 = merge_phyloseq(physeq1,tax)
> physeq2
phyloseq-class experiment-level object
otu_table() OTU Table:   [ 37 taxa and 222 samples ]
sample_data() Sample Data:   [ 222 samples by 7 sample variables ]
tax_table() Taxonomy Table:   [ 37 taxa by 3 taxonomic ranks ]
> identical(physeq, physeq2)
[1] TRUE
The above prints show that the results using the functions phyloseq() and merge_phyloseq() are identical.
  • Step 5: Create phyloseq object with adding a random phylogenetic tree component.

We can use the ape package to create a random phylogenetic tree and add it to the dataset (current version 5.6.2, March 2022; see Sect. 2.4.1). For more information on phylogenetic tree, the readers are referred to Sect. 2.4 and Chap. 3 (Sect. 3.​1.​4). To ensure the tree be correctly created, make sure its tip labels match the names of OTU_table component.
> library("ape")
> random_tree = ape::rtree(ntaxa(physeq), rooted=TRUE, tip.label=taxa_names(physeq))
> plot(random_tree)

Now merge the tree data to the phyloseq object we already have by using the merge_phyloseq(), or use the phyloseq () to build it again from scratch. The results should be identical.

We first use the merge_phyloseq() to merge the tree data with the phyloseq object “physeq”:
> physeq3 = merge_phyloseq(physeq, random_tree)
> physeq3
phyloseq-class experiment-level object
otu_table() OTU Table:   [ 37 taxa and 222 samples ]
sample_data() Sample Data:   [ 222 samples by 7 sample variables ]
tax_table() Taxonomy Table:   [ 37 taxa by 3 taxonomic ranks ]
phy_tree() Phylogenetic Tree:   [ 37 tips and 36 internal nodes ]
We then use the phyloseq() to build it again from scratch:
> library(phyloseq)
> physeq4 = phyloseq(otu, tax, sam,random_tree)
> physeq4
phyloseq-class experiment-level object
otu_table() OTU Table:   [ 37 taxa and 222 samples ]
sample_data() Sample Data:   [ 222 samples by 7 sample variables ]
tax_table() Taxonomy Table:   [ 37 taxa by 3 taxonomic ranks ]
phy_tree() Phylogenetic Tree:   [ 37 tips and 36 internal nodes ]
It is true that they are identical.
> identical(physeq3, physeq4)
[1] TRUE

2.3.1.2 Merge Samples in a phyloseq Object

Example 2.5: Diet Swap Between Rural and Western Populations, Example 2.4, Cont.

The phyloseq package supports two completely different categories of merging data objects: merging two data objects and merging data objects based upon a taxonomic or sample variable, i.e., merging the OTUs or samples in a phyloseq object. Merging OTU or sample indices based on variables in the data is useful to reduce noise or excess features in an analysis or graphic. It requires that the examples to be merged in a dataset are from the same environment and the OTUs to be merged are from the same taxonomic genera. The usual approach is to use a table-join, and hence the non-matching keys are omitted in the result, where keys are Sample IDs or Taxa IDs, respectively.

Here, we illustrate how to merge sample within a phyloseq object via the merge_samples() function. This function is a very useful tool to test an analysis effect by removing the individual effects between replicates or between samples from a particular explanatory variable. By performing the merge_samples() function, the abundance values of merged samples are summed.

First, remove empty samples, i.e., remove unobserved OTUs (sum 0 across all samples). In this case, all samples have sum of OTUs > 0, so the object physeq3a is identical to the object physeq3.

> physeq3a = physeq3
> physeq3a = prune_taxa(taxa_sums(physeq3) > 0, physeq3)
> physeq3a
phyloseq-class experiment-level object
otu_table() OTU Table:   [ 37 taxa and 222 samples ]
sample_data() Sample Data:   [ 222 samples by 7 sample variables ]
tax_table() Taxonomy Table:   [ 37 taxa by 3 taxonomic ranks ]
phy_tree() Phylogenetic Tree:   [ 37 tips and 36 internal nodes ]
Then, add a new sample_data variable to the dataset.
> # Check the names of sample
> sample_variables(physeq3a)
[1] "subject"   "bmi"   "sex"   "nationality"
[5] "timepoint.total" "timepoint.group" "group"
> # Define obese variable
> obese = c("Obese")
> sample_data(physeq3a)$obese <- get_variable(physeq3a, "bmi") %in% obese
> sample_data(physeq3a)$obese
> sampledata= sample_data(physeq3a)
> head(sampledata,3)
Sample Data: [3 samples by 8 sample variables]:
  subject bmi sex nationality timepoint.total timepoint.group
Sample-1  byn  Obese Male   AAM   4   1
Sample-2  nms  Lean Male   AFR   2   1
Sample-3  olt Overweight Male   AFR   2   1
group obese
Sample-1  DI TRUE
Sample-2  HE FALSE
Sample-3  HE FALSE
print(sampledata)
> options(width=100)
> print(sampledata,3)
subject bmi sex nationality timepoint.total timepoint.group group obese
Sample-1 byn Obese Male AAM 4 1 DI TRUE
Sample-2 nms Lean Male AFR 2 1 HE FALSE
Sample-3 olt Overweight Male AFR 2 1 HE FALSE
Sample-4 pku Obese Female AFR 2 1 HE TRUE
Sample-5 qjy Overweight Female AFR 2 1 HE FALSE
Sample-6 riv Obese Female AFR 2 1 HE TRUE
------
> # Merge obese variable
> mergedphyseq3a = merge_samples(physeq3a, "obese")
> mergedphyseq3a
phyloseq-class experiment-level object
otu_table() OTU Table: [ 37 taxa and 2 samples ]
sample_data() Sample Data: [ 2 samples by 8 sample variables ]
tax_table() Taxonomy Table: [ 37 taxa by 3 taxonomic ranks ]
phy_tree() Phylogenetic Tree: [ 37 tips and 36 internal nodes ]
> head(sample_names(sampledata))
[1] "Sample-1" "Sample-2" "Sample-3" "Sample-4" "Sample-5" "Sample-6"
> head(sample_names(physeq3a))
[1] "Sample-1" "Sample-2" "Sample-3" "Sample-4" "Sample-5" "Sample-6"
> sample_names(mergedphyseq3a)
[1] "FALSE" "TRUE"
The following commands define top 10 taxa:
> OTUnames10 = names(sort(taxa_sums(physeq3a), TRUE)[1:10])
> OTUnames10
[1] "Akkermansia" "Dialister" "Bifidobacterium" "Collinsella" "Enterococcus"
[6] "Veillonella" "Fusobacteria" "Serratia" "Campylobacter" "Oceanospirillum"
The following commands create a phyloseq object for the top 10 taxa:
> physeq3a10 = prune_taxa(OTUnames10, physeq3a)
> physeq3a10
phyloseq-class experiment-level object
otu_table() OTU Table: [ 10 taxa and 222 samples ]
sample_data() Sample Data: [ 222 samples by 8 sample variables ]
tax_table() Taxonomy Table: [ 10 taxa by 3 taxonomic ranks ]
phy_tree() Phylogenetic Tree: [ 10 tips and 9 internal nodes ]
The following commands create a subset of phyloseq object for the top 10 taxa:
> mergedphyseq3a10 = prune_taxa(OTUnames10, mergedphyseq3a)
> mergedphyseq3a10
phyloseq-class experiment-level object
otu_table() OTU Table: [ 10 taxa and 2 samples ]
sample_data() Sample Data: [ 2 samples by 8 sample variables ]
tax_table() Taxonomy Table: [ 10 taxa by 3 taxonomic ranks ]
phy_tree() Phylogenetic Tree: [ 10 tips and 9 internal nodes ]
The following commands create a subset of sample for the obese subjects:
> obese_samples = sample_names(subset(sample_data(physeq3a), bmi=="Obese"))
> print(obese_samples)
[1] "Sample-1" "Sample-4" "Sample-6" "Sample-7" "Sample-12" "Sample-14" "Sample-15"
[8] "Sample-20" "Sample-22" "Sample-23" "Sample-28" "Sample-30" "Sample-31" "Sample-36"
[15] "Sample-38" "Sample-40" "Sample-45" "Sample-47" "Sample-49" "Sample-52" "Sample-55"
[22] "Sample-59" "Sample-61" "Sample-65" "Sample-67" "Sample-71" "Sample-72" "Sample-73"
[29] "Sample-76" "Sample-78" "Sample-81" "Sample-85" "Sample-86" "Sample-88" "Sample-92"
[36] "Sample-93" "Sample-95" "Sample-98" "Sample-99" "Sample-106" "Sample-113" "Sample-115"
[43] "Sample-123" "Sample-125" "Sample-127" "Sample-134" "Sample-136" "Sample-144" "Sample-146"
[50] "Sample-148" "Sample-150" "Sample-151" "Sample-152" "Sample-153" "Sample-154" "Sample-155"
[57] "Sample-156" "Sample-160" "Sample-161" "Sample-162" "Sample-163" "Sample-167" "Sample-168"
[64] "Sample-169" "Sample-170" "Sample-173" "Sample-174" "Sample-177" "Sample-178" "Sample-182"
[71] "Sample-183" "Sample-184" "Sample-187" "Sample-188" "Sample-190" "Sample-191" "Sample-193"
[78] "Sample-194" "Sample-196" "Sample-198" "Sample-200" "Sample-202" "Sample-204" "Sample-206"
[85] "Sample-209" "Sample-211" "Sample-212" "Sample-217" "Sample-219" "Sample-220"
The following commands find the OTU table for top 10 taxa in the obese subjects. We can see there are 10 taxa in 90 samples in the OTU table.
> otu_table(physeq3a10)[obese_samples, ]
OTU Table: [10 taxa and 90 samples]
taxa are columns
Campylobacter Serratia Fusobacteria Veillonella Akkermansia Enterococcus Collinsella
Sample-1 302.8297 138.25918 337.9723 563.9701 1229.3921 269.0209 247.3910
Sample-4 313.0234 259.58027 350.2741 356.2538 16246.2164 373.2883 2987.0418
Sample-6 310.9208 655.30520 583.4619 538.3110 1092.0800 746.3891 1698.0409
------
We then sum up the reads over columns and rows for downstream analyses.
> colSums(otu_table(physeq3a10)[obese_samples, ]) ",]# otu_table has sample by otu format
Bifidobacterium Dialister Fusobacteria Collinsella Veillonella Campylobacter
484666 414137 51401 78701 55003 28298
Enterococcus Oceanospirillum Serratia Akkermansia
34466 33171 47480 436003
> rowSums(otu_table(physeq3a10)[obese_samples, ])
Sample-1 Sample-4 Sample-6 Sample-7 Sample-12 Sample-14 Sample-15 Sample-20 Sample-22
6420.736 53822.366 9784.471 23309.503 16467.630 63959.029 7991.240 9219.114 53930.128
------

2.3.1.3 Merge Two phyloseq Objects

In Sect. 2.3.1.1, we Illustrate how to create a phyloseq object using the function merge_phyloseq (). The following commands show how to extract components from an example dataset, and then build them back up to the original form using the merge_phyloseq ().
> otu = otu_table(physeq3)
> tax = tax_table(physeq3)
> sam = sample_data(physeq3)
> tree = phy_tree(physeq3)
> otutax = phyloseq(otu, tax)
> otutax
phyloseq-class experiment-level object
otu_table() OTU Table: [ 37 taxa and 222 samples ]
tax_table() Taxonomy Table: [ 37 taxa by 3 taxonomic ranks ]
> physeq3b = merge_phyloseq(otutax, sam, tree)
> identical(physeq3b, physeq3)
[1] TRUE

The new otutax object only has the OTU table and taxonomy table. When we compare the physeq3b with the original physeq3 object, they are confirmed being identical, where the physeq3b is the object that is built up by the original physeq3 object by using the function merge_phyloseq(). The arguments to the merge_phyloseq() are a mixture of multi-component (otutax) and single-component objects.

2.3.2 Microbiome

The microbiome package was developed by Leo Lahti et al. (Lahti 2017). The package utilizes phyloseq object and wraps some functions of vegan (Jari Oksanen et al. 2019) and phyloseq (McMurdie and Holmes 2013) packages with less obscure syntax (current version 1.14.0, March 2022). It also uses tools from a number of other R extensions, including ggplot2 (Wickham 2016), dplyr (Hadley Wickham et al. 2020), and tidyr (Wickham 2020). Here, we introduce some useful data processing functions in this package.

The microbiome package is built on the phyloseq objects and extends some functions of the phyloseq package in order to facilitate manipulation and processing microbiome datasets, including subsetting, aggregating, and filtering. To install microbiome package of Bioconductor development version in R, type the following R commands in R Console or RStudio Source editor panel.

> library(BiocManager)
> if (!requireNamespace("BiocManager", quietly=TRUE))
+ install.packages("BiocManager")
> BiocManager::install(version = "devel")
> BiocManager::install("microbiome")

2.3.2.1 Create a phyloseq Object

Example 2.6: Global Patterns Datasets

Caporaso et al. (2011)(Caporaso et al. 2011) published a study to compare the microbial communities from 25 environmental samples and three known “mock communities” – a total of 9 sample types – at a depth averaging 3.1 million reads per sample. This study demonstrated excellent consistency in taxonomic recovery and diversity patterns that were previously reported in many other published studies. It also demonstrated that 2000 Illumina single-end reads are sufficient to recapture the same relationships among samples that are observed with the full dataset.

These datasets have been used to illustrate the R programs in the other software including the phyloseq package. We store three data in the directory “~/Documents/QIIME2R/Ch2_IntroductionToR”: (1) “otu_table_GlobalPatterns.csv”, (2) “tax_table_GlobalPatterns.csv”, and (3) “metadata_GlobalPatterns.csv”.

The following commands load the package and example data in R.
> setwd("~/Documents/QIIME2R/Ch2_IntroductionToR")
> otu_tab <- read.csv("otu_table_GlobalPatterns.csv", row.names = 1)
> tax_tab <- read.csv("tax_table_GlobalPatterns.csv",row.names = 1)
> meta_tab <- read.csv("metadata_GlobalPatterns.csv",row.names = 1)
> otumat<-as.matrix(otu_tab)
> taxmat<-as.matrix(tax_tab)
> library(phyloseq)
> otu<-otu_table(otumat,taxa_are_rows = TRUE)
> tax<-tax_table(taxmat)
> sam<-sample_data(meta_tab)
> physeq = phyloseq(otu, tax, sam)
> physeq
phyloseq-class experiment-level object
otu_table() OTU Table: [ 19216 taxa and 26 samples ]
sample_data() Sample Data: [ 26 samples by 7 sample variables ]
tax_table() Taxonomy Table: [ 19216 taxa by 7 taxonomic ranks ]
The GlobalPatterns datasets are included in the phyloseq package. Below, we illustrate how to directly download these datasets from the phyloseq package.
> library(microbiome)
> library(phyloseq)
> library(knitr)
> data(GlobalPatterns)
> # Rename GlobalPatterns data (which is a phyloseq object)
> physeq <- GlobalPatterns
> physeq
phyloseq-class experiment-level object
otu_table() OTU Table: [ 19216 taxa and 26 samples ]
sample_data() Sample Data: [ 26 samples by 7 sample variables ]
tax_table() Taxonomy Table: [ 19216 taxa by 7 taxonomic ranks ]
phy_tree() Phylogenetic Tree: [ 19216 tips and 19215 internal nodes ]

The microbiome package has some very useful functions for processing microbiome data. We introduce them below.

2.3.2.2 Summarize the Contents of phyloseq Object

> summarize_phyloseq(physeq)

After performing the summarize_phyloseq() function, we obtain the information of the “physeq” object, including whether it is compositional, minimum, maximum, total, average, and median number of reads, sparsity, whether any OTU sum to 1, number of singletons, percent of OTUs that are singletons (i.e., exactly one read detected across all samples), and number of sample variables. In this case, we have 7 sample variables: X.SampleID, Primer, Final_Barcode, Barcode_truncated_plus_T, Barcode_full_length, SampleType, and Description.

2.3.2.3 Retrieve the Data Elements of phyloseq Object

A phyloseq object contains OTU table (taxa abundances), sample metadata, taxonomy table (mapping between OTUs and higher-level taxonomic classifications), and phylogenetic tree (relations between the taxa). Some of these are optional.
> physeq
phyloseq-class experiment-level object
otu_table() OTU Table: [ 19216 taxa and 26 samples ]
sample_data() Sample Data: [ 26 samples by 7 sample variables ]
tax_table() Taxonomy Table: [ 19216 taxa by 7 taxonomic ranks ]
phy_tree() Phylogenetic Tree: [ 19216 tips and 19215 internal nodes ]
However, phyloseq is a type “S4” object. The information of object of type “S4” is not directly extractable via the basic R functions, such as head() to subset the data and names() to find the names of data. To extract the information, we need to find out names of slots in the object, and then process at the slot names. For example, we can subset the otu table and sample data names using the following R commands.
slotNames(physeq)
[1] "otu_table" "tax_table" "sam_data" "phy_tree" "refseq"
> head(physeq@otu_table,3)
OTU Table: [3 taxa and 26 samples]
taxa are rows
CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr M31Tong M11Tong
549322 0 0 0 0 0 0 0 0 0 0
522457 0 0 0 0 0 0 0 0 0 0
951 0 0 0 0 0 0 1 0 0 0
------
names(physeq@sam_data)
[1] "X.SampleID" "Primer" "Final_Barcode"
[4] "Barcode_truncated_plus_T" "Barcode_full_length" "SampleType"
[7] "Description"

The following functions in microbiome package facilitate this kind of processing and are very useful for downstream statistical analysis using the microbiome and other R packages, such as retrieving the data elements from phyloseq object.

First, let’s use the abundances() function to retrieve absolute and relative taxonomic abundances from OTU table.
# Absolute abundances
> head(otu_abs<- abundances(physeq),3)
CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr M31Tong M11Tong
549322 0 0 0 0 0 0 0 0 0 0
522457 0 0 0 0 0 0 0 0 0 0
951 0 0 0 0 0 0 1 0 0 0
------
> # Relative abundances
> head(otu_rel <- abundances(physeq, "compositional"),3)
CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr M31Tong M11Tong
549322 0 0 0 0 0 0 0.000e+00 0 0 0
522457 0 0 0 0 0 0 0.000e+00 0 0 0
951 0 0 0 0 0 0 2.305e-06 0 0 0
------
Then, we use the readcount () function to retrieve the total read counts.
> read_tot <- readcount(physeq)
> head(read_tot)
CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr
864077 1135457 697509 1543451 2076476 718943

Next, we use the tax_table() function to retrieve the taxonomy table.

> #Taxonomy table
> tax <- tax_table(physeq)
> head((tax),3)
Taxonomy Table: [3 taxa by 7 taxonomic ranks]:
Kingdom Phylum Class Order Family
549322 "Archaea" "Crenarchaeota" "Thermoprotei" NA NA
522457 "Archaea" "Crenarchaeota" "Thermoprotei" NA NA
951 "Archaea" "Crenarchaeota" "Thermoprotei" "Sulfolobales" "Sulfolobaceae"
Genus Species
549322 NA NA
522457 NA NA
951 "Sulfolobus" "Sulfolobusacidocaldarius"
Now, we use the meta() function to retrieve the metadata which is a data.frame.
> # Pick metadata as data.frame:
> meta <- meta(physeq)
> head((meta),3)
X.SampleID Primer Final_Barcode Barcode_truncated_plus_T
CL3 CL3 ILBC_01 AACGCA TGCGTT
CC1 CC1 ILBC_02 AACTCG CGAGTT
SV1 SV1 ILBC_03 AACTGT ACAGTT
Barcode_full_length SampleType Description
CL3 CTAGCGTGCGT Soil Calhoun South Carolina Pine soil, pH 4.9
CC1 CATCGACGAGT Soil Cedar Creek Minnesota, grassland, pH 6.1
SV1 GTACGCACAGT Soil Sevilleta new Mexico, desert scrub, pH 8.3

Finally, we can melt the phyloseq data as a data frame table for easier plotting and downstream statistical analysis.

> # Melt phyloseq data as a data frame table
> df <- psmelt(physeq)
> kable(head(df,4))
| |OTU |Sample | Abundance|X.SampleID |Primer |Final_Barcode |Barcode_truncated_plus_T |Barcode_full_length |SampleType |Description |Kingdom |Phylum |Class |Order |Family |Genus |Species |
|:------|:------|:--------|---------:|:----------|:-------|:-------------|:------------------------|:-------------------|:------------------|:-------------------------------------------|:--------|:-------------|:----------------|:-------------|:-----------|:--------------|:-------|
|406582 |549656 |AQC4cm | 1177685|AQC4cm |ILBC_17 |ACAGCT |AGCTGT |CAAGCTAGCTG |Freshwater (creek) |Allequash Creek, 3-4 cm depth |Bacteria |Cyanobacteria |Chloroplast |Stramenopiles |NA |NA |NA |
|241435 |279599 |LMEpi24M | 914209|LMEpi24M |ILBC_13 |ACACTG |CAGTGT |CATGAACAGTG |Freshwater |Lake Mendota Minnesota, 24 meter epilimnion |Bacteria |Cyanobacteria |Nostocophycideae |Nostocales |Nostocaceae |Dolichospermum |NA |
|406580 |549656 |AQC7cm | 711043|AQC7cm |ILBC_18 |ACAGTG |CACTGT |ATGAAGCACTG |Freshwater (creek) |Allequash Creek, 6-7 cm depth |Bacteria |Cyanobacteria |Chloroplast |Stramenopiles |NA |NA |NA |
|406574 |549656 |AQC1cm | 554198|AQC1cm |ILBC_16 |ACAGCA |TGCTGT |GACCACTGCTG |Freshwater (creek) |Allequash Creek, 0-1cm depth |Bacteria |Cyanobacteria |Chloroplast |Stramenopiles |NA |NA |NA |

2.3.2.4 Operate on Taxa

We can use the ntaxa() function to obtain the number of taxa and the most abundant taxa using the top_taxa() function.
> # Number of taxa
> num_tax <- ntaxa(physeq)
> num_tax
[1] 19216
> # Most abundant taxa
> top10tax <- top_taxa(physeq, n = 10)
> top10tax
[1] "549656" "331820" "279599" "360229" "317182" "94166" "158660" "329744" "550960" "189047"
Use the rank_names () function to get the names of taxa.
> ranks <- rank_names(physeq) # Taxonomic levels
> ranks
[1] "Kingdom" "Phylum" "Class" "Order" "Family" "Genus" "Species"
Use the taxa () function to get the names of taxa.
> taxa <- taxa(physeq) # Taxa names at the analyzed level
> head(taxa)
[1] "549322" "522457" "951" "244423" "586076" "246140"
Use the subset_taxa() function to subset taxa for analysis.
> physeq
phyloseq-class experiment-level object
otu_table() OTU Table: [ 19216 taxa and 26 samples ]
sample_data() Sample Data: [ 26 samples by 7 sample variables ]
tax_table() Taxonomy Table: [ 19216 taxa by 7 taxonomic ranks ]
phy_tree() Phylogenetic Tree: [ 19216 tips and 19215 internal nodes ]
> # Subset taxa
> sub_bac <- subset_taxa(physeq, Phylum == "Bacteroidetes")
> sub_bac
phyloseq-class experiment-level object
otu_table() OTU Table: [ 2382 taxa and 26 samples ]
sample_data() Sample Data: [ 26 samples by 7 sample variables ]
tax_table() Taxonomy Table: [ 2382 taxa by 7 taxonomic ranks ]
phy_tree() Phylogenetic Tree: [ 2382 tips and 2381 internal nodes ]
Use the sample_sums() function to obtain the total OTU abundance in each sample.
> # Total OTU abundance in each sample
> sam_sum <- sample_sums(physeq)
> head(sam_sum,4)
CL3 CC1 SV1 M31Fcsw
864077 1135457 697509 1543451
We can prune (select) taxa using the map_levels() function. This function is used to map taxa between hierarchy levels. The syntax is map_levels(taxa = NULL, from, to, data), where taxa specifies taxa to convert, if NULL then considering all taxa in the tax.table; from specifies convert from taxonomic level; to specifies convert to taxonomic level; and data is either a phyloseq object or its taxonomy table-class.
> # List of Genera in the Bacteroideted Phylum
> #Prune (select) taxa:
> # List of Genera in the Bacteroideted Phylum
> tax_map <- map_levels(“Bacteroidetes”, "Phylum", "Genus", physeq)
> head(tax_map,3)
$Bacteroidetes
[1] "Balneola" "Thermonema"
[3] "Lutimonas" "Kordia"
[5] "Zhouia" "Capnocytophaga"
------
> # Convert between taxonomic levels (here from Genus (Akkermansia)
> # to Phylum (Verrucomicrobia):
> tax_map2 <- map_levels("Akkermansia", "Genus", "Phylum", tax_table(physeq))
> print(tax_map2)
[1] "Verrucomicrobia"
One useful function is the prune_taxa(), which is often used to prune(select) taxa with positive sum across samples.
> # Taxa with positive sum across samples
> prune <- prune_taxa(taxa_sums(physeq) > 0, physeq)

In microbiome data processing, we sometimes want to merge the desired taxa into “Other” category. The function merge_taxa2() is used to place certain OTUs or other groups into an “other” category. One syntax is given: merge_taxa2(x, taxa = NULL, pattern = NULL, name = “Merged”). The merge_taxa2 () function differs from phyloseq::merge_taxa () lies its last two arguments: here, we can specify the name of the new merged group, and we can specify a common pattern in the name to merge. Here, we merge all Bacteroides groups into a single group named Bacteroides.

> physeq2 <- merge_taxa2(physeq, pattern = "^Bacteroides", name = "Bacteroides")
> head(taxa_bac <- tax_table(physeq2),3)
Taxonomy Table: [3 taxa by 7 taxonomic ranks]:
Kingdom Phylum Class Order Family
549322 "Archaea" "Crenarchaeota" "Thermoprotei" NA NA
522457 "Archaea" "Crenarchaeota" "Thermoprotei" NA NA
951 "Archaea" "Crenarchaeota" "Thermoprotei" "Sulfolobales" "Sulfolobaceae"
Genus Species
549322 NA NA
522457 NA NA
951 "Sulfolobus" "Sulfolobusacidocaldarius"
> head(taxa<- tax_table(physeq),3)
Taxonomy Table: [3 taxa by 7 taxonomic ranks]:
Kingdom Phylum Class Order Family
549322 "Archaea" "Crenarchaeota" "Thermoprotei" NA NA
522457 "Archaea" "Crenarchaeota" "Thermoprotei" NA NA
951 "Archaea" "Crenarchaeota" "Thermoprotei" "Sulfolobales" "Sulfolobaceae"
Genus Species
549322 NA NA
522457 NA NA
951 "Sulfolobus" "Sulfolobusacidocaldarius"
We can filter taxa based on user-specified function values (e.g., mean and variance).
> filt<- filter_taxa(physeq, function(x) mean(x) > 0.5, TRUE)
> filt2<- filter_taxa(physeq, function(x) var(x) > 0.001, TRUE)
We can pick a unique taxonomy level.
> # List unique phylum-level groups
> head(get_taxa_unique(physeq, "Phylum"))
[1] "Crenarchaeota" "Euryarchaeota" "Actinobacteria" "Spirochaetes" "MVP-15" "Proteobacteria"
We can also pick the taxa abundances for a given sample.
> # Pick the taxa abundances for a given sample
> sample1 <- sample_names(physeq)[[1]]
> sample1
[1] "CL3"
Below we pick abundances for a particular taxon (sample1).
> # Pick abundances for a particular taxon
> tax_abund <- abundances(physeq)[, sample1]
> head(tax_abund)
549322 522457 951 244423 586076 246140
0 0 0 0 0 0

2.3.2.5 Aggregate Taxa to Higher Taxonomic Levels

One operation on taxa (aggregating taxa to higher taxonomic levels) is particularly useful. When the phylogenetic tree is missing, the aggregate_taxa () function can be used to aggregate taxa.

> physeq3 <- aggregate_taxa(physeq, "Phylum")
> head(Phylum <- tax_table(physeq3),3)
Taxonomy Table: [3 taxa by 3 taxonomic ranks]:
Kingdom Phylum unique
Crenarchaeota "Archaea" "Crenarchaeota" "Crenarchaeota"
Euryarchaeota "Archaea" "Euryarchaeota" "Euryarchaeota"
ABY1_OD1 "Bacteria" "ABY1_OD1" "ABY1_OD1"

When the phylogenetic tree is available, we can use the functions merge_samples(), merge_taxa(), and tax_glom().

2.3.2.6 Operate on Sample

We can use the sample_names () function to retrieve sample names.
> head(sample_names(physeq))
[1] "CL3" "CC1" "SV1" "M31Fcsw" "M11Fcsw" "M31Plmr"
We can use the subset_samples() function to subset the metadata table by metadata fields (variables).
> physeq_sub <- subset_samples(physeq, SampleType == "Soil")
> physeq_sub
phyloseq-class experiment-level object
otu_table() OTU Table: [ 19216 taxa and 3 samples ]
sample_data() Sample Data: [ 3 samples by 7 sample variables ]
tax_table() Taxonomy Table: [ 19216 taxa by 7 taxonomic ranks ]
phy_tree() Phylogenetic Tree: [ 19216 tips and 19215 internal nodes ]
We also can subset a sample using sample names.
> # Check sample names for Soil and sample CL3 in this phyloseq object
> sub <- rownames(subset(meta(physeq), SampleType == "Soil" & X.SampleID == "CL3"))
> head(sub,3)
[1] "CL3"
We also can pick the subset using sample names.
> # Pick the phyloseq subset with these sample names
> sub2 <- prune_samples(sub, physeq)
> sub2
phyloseq-class experiment-level object
otu_table() OTU Table: [ 19216 taxa and 1 samples ]
sample_data() Sample Data: [ 1 samples by 7 sample variables ]
tax_table() Taxonomy Table: [ 19216 taxa by 7 taxonomic ranks ]
phy_tree() Phylogenetic Tree: [ 19216 tips and 19215 internal nodes ]
The following commands check the total abundance in each sample.
> # Total OTU abundance in each sample
> sam_sum <- sample_sums(physeq)
> head(sam_sum)
CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr
864077 1135457 697509 1543451 2076476 718943
Finally, for longitudinal data, we can use the baseline() function to pick samples at the baseline.
> # Pick samples at the baseline only.
> physeq0 <- baseline(physeq)

2.3.2.7 Operate on Metadata

For downstream statistical analysis, it is very important to familiarize metadata. First, let’s use the sample_variables () function to check what sample variables are available in the metadata.
> sample_variables(physeq)
[1] "X.SampleID" "Primer"
[3] "Final_Barcode" "Barcode_truncated_plus_T"
[5] "Barcode_full_length" "SampleType"
[7] "Description"
One frequently used operation in microbiome study is to calculate alpha and beta diversities and then assign these values as new fields to metadata, which will facilitate association (regression) analysis between the microbial taxa abundance and environmental/clinical factors. Below we first calculate Shannon diversity for samples and then assign the values to the sample data of “physeq” object.
> # Calculate diversity for samples
> shannon <- microbiome::alpha(physeq, index = "shannon")
> head(shannon,3)
diversity_shannon
CL3 6.577
CC1 6.777
SV1 6.498
> # Assign the estimated Shannon diversity to sample metadata
> sample_data(physeq)$Shannon <- shannon
> head(sample_data(physeq)$Shannon,3)
diversity_shannon
CL3 6.577
CC1 6.777
SV1 6.498
As recall, we have already retrieved sample data using the meta() above. Now we can directly assign Shannon diversity to the “meta” dataframe.
> meta <- meta(physeq)
> meta$Shannon <- shannon
> head(meta,3)
X.SampleID Primer Final_Barcode Barcode_truncated_plus_T
CL3 CL3 ILBC_01 AACGCA TGCGTT
CC1 CC1 ILBC_02 AACTCG CGAGTT
SV1 SV1 ILBC_03 AACTGT ACAGTT
Barcode_full_length SampleType Description
CL3 CTAGCGTGCGT Soil Calhoun South Carolina Pine soil, pH 4.9
CC1 CATCGACGAGT Soil Cedar Creek Minnesota, grassland, pH 6.1
SV1 GTACGCACAGT Soil Sevilleta new Mexico, desert scrub, pH 8.3
diversity_shannon
CL3 6.577
CC1 6.777
SV1 6.498

2.3.2.8 Data Transformations

The microbiome package provides a wrapper for transformation of standard sample/OTU. One syntax is given: transform(x, transform = “identify”, target = “OTU”, shift = 0, scale = 1), where x is a phyloseq-class object; transform is used to specify the method of transformation to apply including “compositional” (i.e., relative abundance), “Z,” “log10,” “log10p,” “hellinger,” “identity,” “clr,” or any method from the vegan::decostand function. The target argument is used to specify whether the transformation applies for “sample” or “OTU,” but this option does not affect the log transformation. The shift argument is used to specify a constant to shift the baseline abundance (in transform = “shift”), and the scale is used to scale constant for the abundance values (in transform = “scale”).

One often used data transformation is to transform absolute abundances to relative scale of abundances. Below we print the first three observations of absolute abundances to see what they look like.
# Transform to relative abundances
> head(otu_table(physeq),3)
OTU Table: [3 taxa and 26 samples]
taxa are rows
CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr M31Tong M11Tong
549322 0 0 0 0 0 0 0 0 0 0
522457 0 0 0 0 0 0 0 0 0 0
951 0 0 0 0 0 0 1 0 0 0
------

Now we transform the absolute abundances to relative abundances. By applying the transformation of “compositional,” abundances are returned as relative abundances in [0, 1], i.e., convert to percentages by multiplying with a factor of 100.

> physeq_comp <- microbiome::transform(physeq, "compositional")
> head(otu_table(physeq_comp),3)
OTU Table: [3 taxa and 26 samples]
taxa are rows
CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr M31Tong M11Tong
549322 0 0 0 0 0 0 0.000e+00 0 0 0
522457 0 0 0 0 0 0 0.000e+00 0 0 0
951 0 0 0 0 0 0 2.305e-06 0 0 0
------
The Hellinger transformation is square root of the relative abundance but instead given at the scale [0,1]. The following commands perform the Hellinger transformation.
> physeq_hellinger <- microbiome::transform(physeq, "hellinger")
> head(otu_table(physeq_hellinger),3)
OTU Table: [3 taxa and 26 samples]
taxa are rows
CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr M31Tong M11Tong
549322 0 0 0 0 0 0 0.000000 0 0 0
522457 0 0 0 0 0 0 0.000000 0 0 0
951 0 0 0 0 0 0 0.001518 0 0 0
------
The syntaxes of Log10 and Log10p transforms are transform(x, ‘log10’) and transform(x, ‘log10p’), respectively. The log10p transformation always implements log10(1 + x). The log10 transformation is applied as log10(1 + x) if the data contains zeros. The following commands perform log10 transformation.
> physeq_log10 <- microbiome::transform(physeq, "log10")
> head(otu_table(physeq_log10),3)
OTU Table: [3 taxa and 26 samples]
taxa are rows
CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr M31Tong M11Tong
549322 0 0 0 0 0 0 0.000 0 0 0
522457 0 0 0 0 0 0 0.000 0 0 0
951 0 0 0 0 0 0 0.301 0 0 0
------
For compositional approach of microbiome data analysis, the central log-ratio(CLR) transformation is often used. Since log zero is undefined, CLR transformation usually adds a pseudocount to avoid zeros, such as 0.5, 1, or a pseudocount of min(relative abundance)/2 to exact zero relative abundance entries in OTU table before taking logs. Some software provides an alternative method to impute the zero-inflated unobserved values. For example, the zCompositions R package has options to choose a multiplicative Kaplan-Meier smoothing spline (KMSS) replacement (by implementing the multKM() function), multiplicative lognormal replacement (via the multLN () function), or multiplicative simple replacement (via the multRepl() function). By implementing these functions, in practice, at least draw 1000 (setting n.draws = 1000).
> physeq_clr <- microbiome::transform(physeq, "clr")
> head(otu_table(physeq_clr),3)
OTU Table: [3 taxa and 26 samples]
taxa are rows
CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr M31Tong
549322 -1.503 -1.629 -1.247 -0.416 -0.3503 -0.6268 -0.9298 -0.6795 -0.3541
522457 -1.503 -1.629 -1.247 -0.416 -0.3503 -0.6268 -0.9298 -0.6795 -0.3541
951 -1.503 -1.629 -1.247 -0.416 -0.3503 -0.6268 1.5438 -0.6795 -0.3541
------

Other common transformations are also available in the microbiome package including “Z” and “shift.”

2.3.2.9 Export phyloseq Data into CSV Files

First, we find the slot names by implementing the slotNames() function to phyloseq object “physeq.”
> slotNames(physeq)
[1] "otu_table" "tax_table" "sam_data" "phy_tree" "refseq"
Then, we define otu_table, tax_table, and sam_data as data frames.
> otu_df = as.data.frame(physeq@otu_table)
> tax_df = as.data.frame(physeq@tax_table)
> sam_df = as.data.frame(physeq@sam_data)
Finally, we export otu_table, tax_table, and sam_data using the readr package.
> library(readr)
> # Write otu_table data to a csv file
> write_csv(otu_df, file = "otu_tab_GlobalPatterns.csv")
> # Write tax_table data to a csv file
> write_csv(tax_df, file = "tax_tab_GlobalPatterns.csv")
> # Write sam_table data to a csv file
> write_csv(sam_df, file = "sam_tab_GlobalPatterns.csv")

2.3.2.10 Import CSV, Mothur, and BIOM Format Files into a phyloseq Object

We can use the read_phyloseq() function to read the otu_table, taxonomy_table, and metadata_table with .csv, mothur, and BIOM formats. The syntax is given:
read_phyloseq(otu.file = NULL, taxonomy.file = NULL,
metadata.file = NULL, type = c(“simple”, “mothur”, “biom”), sep = “,”)
where otu.file is used to specify the file containing the OTU table (for mothur this is the file with the .shared extension); taxonomy.file is used to specify the file containing the taxonomy table (for mothur this is typically the consensus taxonomy file with the .taxonomy extension); metadata.file is used to specify the file containing samples by variables metadata; type is used to specify the input data type: either “mothur” or “simple” or “biom” type; and sep is for CSV file separator. The read_phyloseq() function returns a phyloseq-class object. Here, we illustrate read CSV files into a phyloseq-class object.
> library(microbiome)
> # Read CSV files into R
> physeq_csv <- read_phyloseq(otu.file = "otu_table_GlobalPatterns.csv",
+ taxonomy.file = "tax_table_GlobalPatterns.csv",
+ metadata.file = "metadata_GlobalPatterns.csv",
+ type= "simple")
> physeq_csv
phyloseq-class experiment-level object
otu_table() OTU Table: [ 19216 taxa and 26 samples ]
sample_data() Sample Data: [ 26 samples by 7 sample variables ]
tax_table() Taxonomy Table: [ 19216 taxa by 8 taxonomic ranks ]

2.3.3 ampvis2

The R package ampvis2 was developed by Andersen et al. to analyze and visualize 16S rRNA amplicon data (Andersen et al. 2018) (current version 2.7.17, March 2022). Like the phyloseq and microbiome packages, ampvis2 mainly wraps statistical functions from vegan package and ggplot2 graphical package. The attractiveness of ampvis2 is that its syntax of functions is very simple, and their plots such as ordination have very visualized effects. We will introduce ampvis2 to calculate alpha diversity (in Chap. 9), to calculate beta diversity, and to perform ordination plots (in Chap. 10).

2.3.4 curatedMetagenomicData

Shotgun metagenomic sequencing has provided amount of data for studying the taxonomic composition and functional potential of the human microbiome. However, various challenges prevent researchers from taking full advantage of these resources. The Bioconductor package curatedMetagenomicData (Pasolli et al. 2017) was developed to overcome these challenges and help the more shotgun data available publicly for using to perform hypothesis testing and meta-analysis (current version 3.0.10, March, 2022). The curatedMetagenomicData data package was developed for distribution through the Bioconductor (Huber et al. 2015) ExperimentHub platform. The database of curatedMetagenomicData so far provides thousands of uniformly processed and manually annotated human microbiome profiles (Pasolli et al. 2017), including bacterial, fungal, archaeal, and viral taxonomic abundances and gene marker presence and absence from MetaPhlAn2 (Truong et al. 2015), in addition to quantitative metabolic functional profiles and standardized metadata, including coverage and abundance of metabolic pathways and gene families abundance from HUMAnN2 (Abubucker et al. 2012). Metagenomic data with matched health and socio-demographic data are provided as Bioconductor ExpressionSet objects, with options for automatic conversion of taxonomic profiles to phyloseq or metagenomeSeq objects for microbiome-specific analyses. In this book, we will not implement this package; the interested readers are referred to (Lucas Schiffer and Waldron 2021).

2.4 Some R Packages for Analysis of Phylogenetics

Evolutionary analysis (analyses of phylogenetic trees) provides insights of phylogenetics and evolution that facilitate comparative biology and microbiome study. In this section, we introduce three phylogenetic R packages: ape, phytools, and castor. Each package provides various functions for writing, reading, plotting, and manipulating, and even inferring phylogenetic trees. However, we focus on phylogenetic tree generation, writing, and reading and provide simple tree plotting. For building phylogenetic tree in QIIME 2, the reader is referred to Chap. 5 (Sect. 5.​2).

Example 2.7: Diet Swap Between Rural and Western Populations, Example 2.4, Cont.

In Sect. 2.3.1.1, we have created a phyloseq object using Example 2.4 (Diet swap between rural and western populations). Here we continue to use this dataset to illustrate the analyses of phylogenetics.

> physeq
phyloseq-class experiment-level object
otu_table() OTU Table: [ 37 taxa and 222 samples ]
sample_data() Sample Data: [ 222 samples by 7 sample variables ]
tax_table() Taxonomy Table: [ 37 taxa by 3 taxonomic ranks ]

2.4.1 Ape

The R package ape stands for Analyses of Phylogenetics and Evolution and is a modern and useful software for the study of biodiversity and evolution (Paradis and Schliep 2018)(current version 5.6.2, March 2022).

The ape package provides functions for reading, writing, plotting, and manipulating phylogenetic trees, conducting analyses of comparative data in a phylogenetic framework, ancestral character, diversification and macroevolution, computing distances from DNA sequences, reading and writing nucleotide sequences, as well as performing other functionality such as Mantel’s test, generalized skyline plots. Here, we illustrate how to create and write a random tree using the function rtree(). This function generates a tree by splitting randomly the edges until the specified number of tips is obtained. One syntax of the rtree() is given:
rtree(n, rooted = TRUE, tip.label = NULL, br = runif, equiprob = FALSE)
where the argument n is an integer giving the number of tips in the tree; rooted is a logical value that is used to indicate whether the tree should be rooted (the default); tip.label is a character vector that is used to specify the tip labels with the tips “t1,” “t2,” ..., being given if not specified. The argument br is used to specify an R function used to generate the branch lengths; use NULL to simulate only a topology. The argument equiprob (new since ape version 5.4.1) is a logical value that is used for specifying whether topologies are generated in equal frequencies. If equiprob = FALSE is specified, then the unbalanced topologies are generated in higher proportions than the balanced ones. The function rtree() returns an object of class “phylo” (Fig. 2.8).
  • Step 1: Generate random rooted and unrooted trees.

A schema of the phylogenetic tree depicts the topologies generated with equal frequencies, then the unbalanced topologies are generated in higher proportions than the balanced ones.

Fig. 2.8

A random phylogenetic tree generated by the ape package using the Diet swap between rural and western populations

library("ape")
> # Generate a random rooted tree
random_tree = rtree(ntaxa(physeq), rooted=TRUE, tip.label=taxa_names(physeq))
> # Generate a random unrooted tree
> random_unrooted_tree = rtree(ntaxa(physeq), rooted=FALSE, tip.label=taxa_names(physeq))
  • Step 2: Plot the random trees.

> # Figure 2.8
> # Plot this random tree
plot(random_tree)
  • Step 3: Write the phylogenetic trees.

The generated tree can be easily written to and read back from files. The following commands write a rooted and an unrooted phylogenetic tree using the function write.tree():

> # Write the rooted random tree
> write.tree(random_tree, "Rooted_tree_dietswap.tre")
> # Write the unrooted random tree
> write.tree(random_unrooted_tree, "Unrooted_tree_dietswap.tre")
  • Step 4: Read back the phylogenetic trees.

The following commands read back a rooted and an unrooted phylogenetic tree using the function readLines ():
> cat(readLines("Rooted_tree_dietswap.tre"))
> cat(readLines("Unrooted_tree_dietswap.tre"))

2.4.2 Phytools

The R package phytools (Liam J. Revell 2022; Liam J Revell 2012) developed by Liam J. Revell is phylogenetic tools for comparative biology (and other things) (current Version 1.0–1, January, 2022). It provides functions in phylogenetic comparative biology and numerous methods for visualizing, manipulating, reading or writing, and even inferring phylogenetic trees.
  • Step 1: Generate random rooted and unrooted trees.

In Sect. 2.4.1, we have generated a random tree named as “random_tree” using the following commands (Fig. 2.9).

A schema of a rooted tree depicts phylogenetic comparative biology.

Fig. 2.9

The rooted phylogenetic tree plotted by the plotTree () function via the phytools package using dietswap data

> # Generate a random rooted tree
> random_tree = ape::rtree(ntaxa(physeq), rooted=TRUE, tip.label=taxa_names(physeq))
  • Step 2: Write the phylogenetic trees

The following commands write out this tree using the function writeNexus() from the phytools package.

This function writes a tree to file in Nexus format. It is redundant with ape::write.nexus(). The syntax is writeNexus(tree, “file = “”), where the argument tree is the object of class “phylo” or “multiPhylo”;

file is the file name for output.

> library(phytools)
> library(ape)
> library(maps)
> writeNexus(random_tree, "Rooted_tree_dietswap.nex")
  • Step 3: Read back the phylogenetic trees.

The following commands read trees from files in the directory.
> cat(readLines("Rooted_tree_dietswap.nex"), sep = "\n")
  • Step 4: Plot the phylogenetic trees.

We can use the plotTree() function to plot rooted phylogenetic tree (a rooted phylogram).
> Figure 2.9
> plotTree(random_tree)

2.4.3 Castor

The class “phylo” in phylogenetic tree is the most important data unit. It was introduced in ape in August 2002. The class “phylo” with the tree topology encoded in the member variable tree.edge. Most existing phylogenetic packages have been designed using the “phylo” format for much smaller trees containing at most a few thousand tips, and thus scale poorly to large datasets (Louca and Doebeli 2017). Unlike these existing phylogenetic packages, the R package castor (Louca and Doebeli 2017) was developed to compare phylogenetics on large phylogenetic trees (>10,000 tips)(current version 1.7.2, March, 2022). It provides efficient implementations of common phylogenetic functions, focusing on analysis of trait evolution on fixed trees. This package provides four kinds of functions: (1) efficient tree manipulation functions including pruning, rerooting, and calculation of most-recent common ancestors; (2) calculating distances from the tree root and pairwise distance matrices; (3) calculating phylogenetic signal and mean trait depth (trait conservatism); and (4) providing efficient ancestral state reconstruction and hidden character prediction of discrete characters on phylogenetic trees using maximum likelihood and maximum parsimony methods. Below we introduce how to  write and read a tree in Newick (parenthetic) format using this package. For more information about Newick format, see Chap. 3 (Sect. 3.​1.​4).

To install the latest version of this package, enter the following commands in R or RStudio.
> install.packages("castor")
> library(castor)
  • Step 1: Write the phylogenetic trees.

The write_tree() function writes a phylogenetic tree to a file or a string, in Newick (parenthetic) format. If the tree is unrooted, it is first rooted internally at the first node. The syntax of this function is given:

write_tree (tree, file = “”, append = FALSE, digits = 10, quoting = 0, include_edge_labels = FALSE,
include_edge_numbers = FALSE)

The arguments: tree is a tree of class “phylo”; file is an optional path to a file, to which the tree should be written. The file may be overwritten without warning. If left empty (default), then a string is returned representing the Newick format tree; otherwise, the tree is directly written to the file. The argument append is a logical value that is used for specifying whether the tree should be appended at the end of the file, rather than replacing the entire file (if it exists). digits is an integer that is used to specify the number of significant digits for writing edge lengths; quoting is an integer, specifying whether and how to quote tip/node/edge names with 0, no quoting at all; 1, always use single quotes; 2, always use double quotes; −1, only quote when needed and prefer single quotes if possible; −2, only quote when needed and prefer double quotes if possible. The arguments include_edge_labels and include_edge_numbers are the logical values; they are used for specifying whether to include edge labels (if available) in the output tree, inside square brackets, edge numbers (if available) in the output tree, inside curly braces, respectively. These two arguments are an extension (Matsen et al. 2012) to the standard Newick format. This function write_tree() is comparable to (but typically much faster than) the ape function write.tree().

# Obtain a string representation of the tree in Newick format
> Newick_string = write_tree(random_tree)
> Newick_string
  • Step 2: Read back a tree from a string or file.

The read_tree() function loads a tree from a string or file in Newick (parenthetic) format. The syntax is given:

read_tree( string = “”, file = “”, edge_order = “cladewise”, include_edge_lengths = TRUE, look_for_edge_labels = FALSE, look_for_edge_numbers = FALSE, include_node_labels = TRUE, underscores_as_blanks = FALSE, check_label_uniqueness = FALSE, interpret_quotes = FALSE, trim_white = TRUE)
where string is a character containing a single tree in Newick format; file is character that is used to specify a path to an input text file containing a single tree in Newick format. The argument edge_order is either “cladewise” or “pruningwise,” specifying the order in which edges should be listed in the returned tree. For other arguments of read_tree(), the readers are referred to the manual of the castor package.
> # Re-parse tree from string
> parsed_tree = read_tree(Newick_string)
> parsed_tree
Phylogenetic tree with 37 tips and 36 internal nodes.
Tip labels:
Anaerobiospirillum, Micrococcaceae, Haemophilus, Aneurinibacillus, Aquabacterium, Lactococcus, ...
Rooted; includes branch lengths.

2.5 BIOM Format and Biomformat Package

The Biological Observation Matrix (BIOM) format(McDonald et al. 2012) or the BIOM file format (canonically pronounced biome) was designed to be a general-use format for representing biological sample by observation contingency tables (i.e., OTU tables, metagenome tables, or a set of genomes). BIOM is a recognized standard for the Earth Microbiome Project and is a Genomics Standards Consortium supported project. See more information on BIOM format in Chap. 3 (Sect. 3.​1.​3).

The biomformat package (McMurdie and Paulson 2021) is a utility package. Its use depends on other packages. It provides I/O functionality and functions to facilitate data processing from BIOM format files, such as providing tools to access data in “data.frame,” “matrix,” and “Matrix” classes.

To install this package, start R (version 4.1) and enter:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("biomformat")
> library("biomformat");
packageVersion("biomformat")
[1] ‘1.20.0’
  • Step 1: Read BIOM format file into R.

The following commands import BIOM formats of otu-table into R using the function read_biom ().
> setwd("~/Documents/QIIME2R/Ch2_IntroductionToR")
> otu_tab_biom <- "seq_table_L7.biom"
> biom_obj <- read_biom(otu_tab_biom)
> biom_obj
biom object.
type: OTU table
matrix_type: sparse
637 rows and 82 columns
The contents of BIOM data are briefly summarized above.
  • Step 2: Access BIOM data.

The biomformat package includes accessor functions. We can use them to access the data with BIOM format files.

Access Core Observation Data

The core “observation” data is stored in either sparse or dense matrices with the BIOM format. The sparse matrix support is carried with R via the Matrix package. We can use the biom_data() function to access the BIOM format matrices.

> head(biom_data(biom_obj),3)
3 x 82 sparse Matrix of class "dgCMatrix"
[[ suppressing 82 column names ‘Sun071.PG1’, ‘Sun027.BF2’, ‘Sun003.BA3’ ... ]]
D_0__Archaea;D_1__Thaumarchaeota;D_2__Nitrososphaeria;D_3__Nitrososphaerales;D_4__Nitrososphaeraceae;Other;Other .
D_0__Bacteria;D_1__Acidobacteria;D_2__Acidobacteriia;D_3__Solibacterales;D_4__Solibacteraceae (Subgroup 3);D_5__Bryobacter;Other .
D_0__Bacteria;D_1__Acidobacteria;D_2__Acidobacteriia;D_3__Solibacterales;D_4__Solibacteraceae (Subgroup 3);Other;Other

Obviously the accessed results from the fancier sparse Matrix of class are not easy to read. We can use the as() function to easily coerce the data to the simple, standard “matrix” class.

> head(as(biom_data(biom_obj), "matrix"),3)
Sun071.PG1 Sun027.BF2 Sun003.BA3
D_0__Archaea;D_1__Thaumarchaeota;D_2__Nitrososphaeria;D_3__Nitrososphaerales;D_4__Nitrososphaeraceae;Other;Other 0 0 0
D_0__Bacteria;D_1__Acidobacteria;D_2__Acidobacteriia;D_3__Solibacterales;D_4__Solibacteraceae (Subgroup 3);D_5__Bryobacter;Other 0 0 0
D_0__Bacteria;D_1__Acidobacteria;D_2__Acidobacteriia;D_3__Solibacterales;D_4__Solibacteraceae (Subgroup 3);Other;Other 0 0 0
Access Observation Metadata

Observation metadata is metadata associated with the individual units being counted/recorded in a sample. For microbiome census data, it is often a taxonomic classification and something about a particular OTU/species. Here, the observations are counts of OTUs and the metadata is taxonomic classification, if present. In this case, the absence of observation metadata is reported. We can access observation metadata using the observation_metadata () function.

> observation_metadata(biom_obj)
NULL
Access Sample Metadata

Sample metadata describe the properties of the samples. Similarly, we can access sample metadata using the sample_metadata() function.

> sample_metadata(biom_obj)
NULL
  • Step 3: Write BIOM format data.

The following commands write the biom objects to a temporary file using the write_biom () function and then read it back using the read_biom () function and store it as variable infile. Next it checks whether these two objects are identical using the identical() function.

> outfile_biom = tempfile()
> write_biom(biom_obj, outfile_biom )
> infile = read_biom(outfile_biom)
> identical(biom_obj, infile)
[1] FALSE
  • Step 4: Perform visualization and analysis on BIOM data.

Since the BIOM format files can be accessed, we can visualize and analyze these data using other R functions such as plots.
> plot(biom_data(biom_obj))
> boxplot(as(biom_data(biom_obj), "vector"))
> heatmap(as(biom_data(biom_obj), "matrix")

2.6 Creating Analysis Microbiome Dataset

Example 2.8: Vaginal Microbiome Study

DiGiulio et al. (2015) conducted a longitudinal vaginal microbiome study to investigate the bacterial taxonomic composition for pregnant and postpartum women. This case-control study included 49 pregnant women, 15 of whom delivered preterm. The discovery data of 3767 specimens from 40 women were collected prospectively and weekly during gestation and monthly after delivery from the vagina, distal gut, saliva, and tooth/gum. The final dataset for pregnant women who delivered at term and preterm contained a total of 1271 taxa from 3432 specimens. Three datasets were downloaded and named as “TemporalSpatialOTU.csv,” “TemporalSpatialTaxonomy.csv,” and “TemporalSpatialClinicalMerged.csv,” respectively. The OTU data are available at species level. The taxonomy data are ranked at 7 levels: Kingdom, Phylum, Class, Order, Family, Genus, and Species. The available clinical information included race, weeks/days at the collection of samples, way of delivery, and household income level. The clinical data of 9 pregnant women in validated dataset is not available.

The study found that those women who with microbiome profiles had high abundances of Lactobacillus were less likely to experience preterm births, defined as delivery before gestational weeks. The study also found that those women who had high amounts of Prevotella had a higher occurrence of preterm births.

These datasets have been used by others to illustrate their proposed models including Zhang et al. (Zhang et al. 2018). In this section, we illustrate how to create a microbiome dataset that is suitable for longitudinal data analysis. We illustrate the dataset derivation by the following 6 steps.
  • Step 1: Load raw microbiome datasets.

We first use read.csv() function to load the three raw pregnancy microbiome datasets including OTU, taxonomy, and clinical data.
> setwd("~/Documents/QIIME2R/Ch2_IntroductionToR")
> otu_tab <- read.csv("TemporalSpatialOTU.csv", row.names = 1,check.names = F)
> tax_tab <- read.csv("TemporalSpatialTaxonomy.csv", row.names = 1)
> sam_tab <- read.csv("TemporalSpatialClinicalMerged.csv", row.names = 1)
  • Step 2: Prune taxa using the gsub() function, and string pattern matching and replacement.

The goal of pruning taxa is to make data more readable. The original names of taxonomy are split by “:” and “-”. We want to replace them with “_”.
> head(tax_tab,3)
Kingdom Phylum Class Order Family Genus Species
4330849 Bacteria P:SR1 P:SR1 P:SR1 P:SR1 P:SR1 P:SR1
4400869 Bacteria P:SR1 P:SR1 P:SR1 P:SR1 P:SR1 P:SR1
4457086 Bacteria P:GN02 C:BD1-5 C:BD1-5 C:BD1-5 C:BD1-5 C:BD1-5
We can use the functions apply() and gsub() as well as the string pattern matching and replacement techniques to replace “:” and “-” with “_” (Xia et al. 2018).
> tax_tab <- apply(tax_tab, 2, function(x) gsub(':|-', '_', x))
> head(tax_tab,3)
Kingdom Phylum Class Order Family Genus Species
4330849 "Bacteria" "P_SR1" "P_SR1" "P_SR1" "P_SR1" "P_SR1" "P_SR1"
4400869 "Bacteria" "P_SR1" "P_SR1" "P_SR1" "P_SR1" "P_SR1" "P_SR1"
4457086 "Bacteria" "P_GN02" "C_BD1_5" "C_BD1_5" "C_BD1_5" "C_BD1_5" "C_BD1_5"
  • Step 3: Use the phyloseq and Tidyverse packages to filter and subset taxonomy and sample data.

Combining the phyloseq and tidyverse packages is an effective way to filter and subset microbiome data. We first create a phyloseq object using the phyloseq package, and then process the filtering and subsetting with tidyverse package.
> library(phyloseq)
> OTU <- otu_table(otu_tab, taxa_are_rows = FALSE)
> TAX <- tax_table(as.matrix(tax_tab))
> SAM <- sample_data(sam_tab)
> table(SAM$Preg,SAM$BodySite)
Saliva Stool Tooth_Gum Vaginal_Swab
FALSE 167 152 147 166
TRUE 740 580 719 761
> table(SAM$Outcome,SAM$BodySite)
Saliva Stool Tooth_Gum Vaginal_Swab
Marginal 108 92 110 111
Preterm 82 42 82 82
Term 668 557 623 684
VeryPreterm 49 41 51 50
> # Genus level
> library(tidyverse)
> # Keep only samples collected during pregnancy
> # Keep only samples in vaginal swab
> genus <- phyloseq(OTU, TAX, SAM) %>%
+ tax_glom(taxrank = 'Genus') %>% # Combine taxa at genus level
+ subset_samples(Outcome != 'Marginal') %>%# Same as in the original paper
+ subset_samples(BodySite == 'Vaginal_Swab') %>% # Look only at vaginal samples
+ subset_samples(GDColl <= GDDel) # Remove samples collected after delivery
> genus
phyloseq-class experiment-level object
otu_table() OTU Table: [ 269 taxa and 678 samples ]
sample_data() Sample Data: [ 678 samples by 64 sample variables ]
tax_table() Taxonomy Table: [ 269 taxa by 7 taxonomic ranks ]
From the above phyloseq object, we can see there are 678 samples with 64 sample variables. The original paper analyzed the abundance data at species level and divided the samples into 5 Vaginal Community State Types (CSTs) and only analyzed samples with a Lactobacillus-poor vaginal community state type (CST 4). Here, for the illustration of data derivation, we averaged the abundance data at genus level and did not use the complete same criteria for sample filtering.
  • Step 4: Use the Tidyverse and dplyr packages to finalize analysis dataset.

There are 64 sample variables in genus sample data. We can check the variable names using colnames(sample_data(genus)) or head(sample_data(genus)).

> colnames(sample_data(genus))
[1] "SampleID"
[2] "BarcodeSequence"
[3] "LinkerPrimerSequence"
[4] "BodySite"
[5] "SubjectID"
------
The following procedure is used to create or rename the variables. This is easy to do using tidyverse package combining with the mutate() function.
> library(tidyverse)
> # Final metadata
> meta_tab <- data.frame(sample_data(genus)) %>%
+ filter (GDColl >= 0) %>%
+ mutate(Weeks = GWColl) %>%
+ mutate(N = NumReads) %>% # Total reads: some models also use LibrarySize
+ mutate(Subject = SubjectID) %>%
+ mutate(Preterm = (Outcome != 'Term')) %>%
+ mutate(Delivery = 1) %>%
+ mutate(TrimColl = factor(TrimColl))
> dim(meta_tab)
[1] 678 68

Finally, for ease of processing, we want to remove the variables that we do not use in the analysis. Although the tidyverse package contains the dplyr package, sometimes they are masked. In this case, we explicitly call the select() function of dplyr using the function dplyr::select().

> # Obtain final meta dataset using the setect() function from the dplyr package
> head(meta_genus<-dplyr::select(meta_tab,Subject,BodySite, GDColl,GWColl,GDDel,GWDel,
+ Weeks,N,NumReads,TrimColl,Race,Ethnicity,LibrarySize,PrePreg,Preg,Term,Preterm,Outcome),3)
Subject BodySite GDColl GWColl GDDel GWDel Weeks N NumReads TrimColl
1 10003 Vaginal_Swab 198 29 260 38 29 2341 2341 3
2 10003 Vaginal_Swab 205 30 260 38 30 1136 1136 3
3 10003 Vaginal_Swab 212 31 260 38 31 2344 2344 3
Race Ethnicity LibrarySize PrePreg Preg Term Preterm Outcome
1 American Indian Non-hispanic 2347 FALSE TRUE TRUE FALSE Term
2 American Indian Non-hispanic 1144 FALSE TRUE TRUE FALSE Term
3 American Indian Non-hispanic 2356 FALSE TRUE TRUE FALSE Term
> colnames(meta_genus)
[1] "Subject" "BodySite" "GDColl" "GWColl" "GDDel"
[6] "GWDel" "Weeks" "N" "NumReads" "TrimColl"
[11] "Race" "Ethnicity" "LibrarySize" "PrePreg" "Preg"
[16] "Term" "Preterm" "Outcome"
> dim(meta_genus)
[1] 678 18
> dim(otu_genus)
[1] 678 269
> dim(meta_genus)
[1] 678 18
> tax_genus<-as.data.frame(tax_table(genus))
> dim(tax_genus)
[1] 269 7
  • Step 5: Use the Readr package to write the final analysis datasets to csv and tsv files.

> # Write otu_genus data to a csv file
> write_csv(otu_genus, file = "otu_genus.csv")
> # Write tax_genus data to a csv file
> write_csv(tax_genus, file = "tax_genus.csv")
> # Write meta_genus data to a csv file
> write_tsv(meta_genus, file = "meta_genus.csv")
> # Write meta_genus data to a tsv file
> write_tsv(meta_genus, file = "meta_genus.tsv")

For some readers, this example may be too complicated at the beginning chapter. Do not worry, you can skip these materials. When you read further in the later chapters, you can go back to look at these programs again; you will find that they are very useful.

2.7 Summary

Chapter 2 described and introduced some useful R functions, R packages, specifically designed R packages for microbiome data, and some R packages for analysis of phylogenetics, as well as BIOM format and biomformat package. The R programming language was created by statisticians Ross Ihaka and Robert Gentleman for statistical computing and graphics supported by the R Core Team and the R Foundation for Statistical Computing. R first appeared in August 1993 and the stable version 4.1.2 was released on November 1, 2021. R is available as open-source free software and is getting popular in academic research for statistical analysis and computing, and specifically is a main tool for development of statistical methods and models for microbiome and other omics studies. This chapter provided 6 sections on the R functions and packages for microbiome data analysis, including (1) useful R functions for saving, reading, and downloading data; (2) R utility and data visualization packages; (3) specifically designed R packages for microbiome data with focusing on the phyloseq and microbiome packages; (4) some R packages for analysis of phylogenetics; (5) BIOM format and biomformat package; and (6) creating analysis microbiome dataset. In Chap. 3, we will introduce some basic data processing operations in QIIME 2.